Detecting Vital Documents in Massive Data Streams
نویسندگان
چکیده
Existing knowledge bases, including Wikipedia, are typically written and maintained by a group of voluntary editors. Meanwhile, numerous web documents are being published partly due to the popularization of online news and social media. Some of the web documents, called “vital documents”, contain novel information that should be taken into account in updating articles of the knowledge bases. However, it is practically impossible for the editors to manually monitor all the relevant web documents. Consequently, there is a considerable time lag between an edit to knowledge base and the publication dates of such vital documents. This paper proposes a realtime detection framework of web documents containing novel information flowing in massive document streams. The framework consists of twostep filter using statistical language models. Further, the framework is implemented on the distributed and faulttolerant realtime computation system, Apache Storm, in order to process the large number of web documents. On a publicly available web document data set, the TREC KBA Stream Corpus, the validity of the proposed framework is demonstrated in terms of the detection performance and processing time. TYPE OF PAPER AND
منابع مشابه
Spatial Semantic Scan: Detecting Subtle, Spatially Localized Events in Text Streams
Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have shortcomings that make them unsuitable for rapid detection of locally emerging events on massive text streams. We describe Spatially Compact Semantic Scan (SCSS) that has been developed specifically to overcome the shortcomings of current methods in detecting new spati...
متن کاملIncremental View Maintenance for Active Documents
In this paper, we develop algorithmic datalogbased foundations for the incremental processing of tree-pattern queries over active documents, i.e. document with incoming streams of data. We dene query satis ability for such documents based on a logic with 3-values: true , false forever , and false for now . Also, given an active document and a query, part of the document (and in particular, some...
متن کاملSemantic Scan: Detecting Subtle, Spatially Localized Events in Text Streams
Early detection and precise characterization of emerging topics in text streams can be highly useful in applications such as timely and targeted public health interventions and discovering evolving regional business trends. Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have numerous shortcomings that make them unsuitab...
متن کاملDetecting Anomalies in Massive Traffic Streams Based on S-Transform Analysis of Summarized Traffic Entropies
Detecting traffic anomalies is an indispensable component of overall security architecture. As Internet and traffic data with more sophisticated attacks grow exponentially, preserving security with signaturebased traffic analyzers or analyzers that do not support massive traffic are not sufficient. In this paper, we propose a novel method based on combined sketch technique and S-transform analy...
متن کاملDetecting Website Redesigns via Template Similarity on Streams of Documents
Most websites undergo a redesign from time to time. Along with the change of the appearance of the site comes a different document structure. Hence, redesigns can be detected by observing changes in the structural similarity of monitored HTML documents. Assuming further to monitor not a fixed document set but a series of the newest documents (e.g. provided by an RSS feed) transforms the task of...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- OJWT
دوره 2 شماره
صفحات -
تاریخ انتشار 2015